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ABSTRACT 



Two forms of a mathematics achievement test were 
developed for use in the evaluation of an experimental instructional 
program at the seventh grade level. Subjects were drawn primarily 
from minority groups at three urban junior high schools; all were of 
approximately normal scholastic aptitude, but one year or more behind 
their peers in mathematics achievement. Eligible students were 
assigned to experimental or regular mathematics classes and tested at 
the beginning and end of the school year. After the posttesting 
teachers were asked to rate all items on the two test forms for the 
extent to which their instruction would be likely to facilitate the 
ability of their students to answer each correctly. The findings 
imply that both groups of teachers tended to direct instruction at 
skills which were relatively well-developed at entry rather than at 
areas in which students were initially weak. Possible reasons why 
this might occur are examined and potentially important implications 
of these findings for instructional practice and evaluation 
methodology are discussed, especially with regard to instructional 
programs for educationally deprived students. In particular, the 
importance of mapping student entry skills before designing 
instructional programs is stressed, along with the role of evaluation 
in providing such initial feedback information. (Author/AG) 
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STUDENT ENTRY SKILLS AND THE EVALUATION 
OF INSTRUCTIONAL PROGRAMS: A CASE STUDY 



It is axiomatic among curriculum 
experts that teachers often fail to 
acquaint themselves with the patterns 
of skills students bring initially to 
their classrooms. When new instruc- 
tional programs are being developed 
for later use by large numbers of 
teachers, such failure to monitor en- 
try skills can result in a grossly in- 
adequate match between the learning 
needs of students and the content of 
instruction. The present research 
goes further, however, and suggests, 
that under certain circumstances 
there may even be a tendency for 
teachers to emphasize skills already 
mastered by their students. This 
paper presents empirical evidence for 
such an assertion and attempts to 
deduce some of the reasons why teach- 
ers under conditions similar to those 
encountered in this research might 
direct instruction at the improvement 
of skills already attained by a major- 
ity of their students. 

Students and Instruc- 
tional Programs 

Findings reported in this re- 
search are based on data collected as 
part of the evaluation of a curricu- 
lum development program in seventh 
grade mathematics. For the analy- 
ses reported here data were avail- 
able on 488 students in three junior 
high schools. Of these, 285 were as- 
signed to experimental classes taking 
the curriculum under development, and 
203 were assigned to comparison clas- 
ses providing the regular mathematics 
curriculum for each school. Within 



each school experimental and com- 
parison groups were not in every 
case equivalent at the beginning of 
instruction, but this fact has no 
bearing on the issues discussed here. 

The three schools were located in a 
metropolitan context, and in two 
cases were of an "inner" city type. 
Students were varied in racial and 
ethnic characteristics. As identi- 
fied below, approximately 95% of the 
students in School 1 were Mexican- 
Americans, with an approximately 
equal percentage of the students in 
School 2 being Negro. School 3 was 
of a more mixed ethnic character, 
with somewhat over 60% of the stu- 
dents Caucasian, about 30% Mexican - 
American, and a small percentage of 
Negro and Oriental students. The 
two sexes were approximately equally 
represented in all groups. 

The majority of the students in 
the three schools had taken the 
California Test of Mental Maturity 
(1957 Short Form) at the end of the 
fifth grade. Mean total I. Q. scores 
at each school for experimental and 
comparison groups were respectively: 
School 1, 94 and 90; School 2, 90 for 
both groups; and School 3, 95 and 100. 
All students assigned by the schools 
to either experimental or comparison 
classes were at least one year behind 
in mathematics achievement for their 
own school. In general, then, students 
participating in this research can be 
described as members of urban minority 
groups who have shown unsatisfactory 
achievement in mathematics in compari- 
son with their peers. Their academic 
performance probably cannot be accounted 
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■for by low scholastic aptitude, since 
means for all groups are within the 
normal range. 

Teachers in the experimental 
program were volunteers selected for 
their high professional qualifica- 
tions. With one exception, all ex- 
perimental teachers were from other 
schools and were on temporary assign- 
ment to the program. Teachers of 
comparison classes were on the reg- 
ular staff of the participating 
schools. Thirteen experimental and 
thirteen comparison teachers contri- 
buted data to the present research. 

The experimental teachers spent 
only half of their time in the class- 
room, using the rest of the day for 
program development activities. In 
general, the experimental programs 
at the three schools utilized pro- 
grammed materials, games and modern 
media in the presentation of content 
that was to some extent oriented to 
the M modern M math. While such ele- 
ments were by no means excluded from 
the comparison mathematics classes, 
the latter were less richly supplied 
with materials and in most cases 
place more emphasis on the develop- 
ment of computational skills. The 
experimental programs developed at 
the three schools were by no means 
identical, and all data presented 
below are broken down by group with- 
in school. 

Testing 

Students in both experimental 
and comparison programs were admin- 
istered a variety of tests and other 
measures, though only the Diagnostic 
Test is of interest here. This in- 
strument was constructed for the 
purpose of comparing the achievement 



of students in the various groups and 
was made up of items judged to be perti- 
nent to the instructional goals of the 
experimental program. The possibly dif- 
ferent instructional goals of the compari- 
son program were not specifically taken 
into consideration except for the inclusion 
of a subset of computational problems in ad- 
dition, subtraction, multiplication, and 
division. The rest of the items provided 
a more heterogeneous array of combina- 
tions of content and processes than would 
be found in the typical standardized 
achievement test, since one of the inten- 
tions of the research was to compare the 
two instructional programs on a variety 
of individual items or subgroups of items. 
Two forms of the test were developed by 
randomly assigning the members of a pair 
of items of each type to Form A or Form 
B of the test. Students took the same 
form of the Diagnostic Test at the begin- 
ning and end of the school year, with forms 
randomly assigned to classes within each 
program at each school, The proportion of 
students passing each item on each form was 
calculated for pre- and posttest data for 
each combination of school, form of the 
test, and instructional program (e. g., 
School 1, Form A, experimental program). 

Ratings of Relevancy 

After the posttesting, all teachers 
in the two programs were given for the 
first time copies of the two forms of the 
Diagnostic Test and asked to judge the 
relevancy of each item to instruction in 
their classes during the school year. 
Specifically, the 13 cooperating teachers 
in each group were instructed to: 

Make a judgment on the extent to 
which instruction in your mathe- 
matics classes this year would 
facilitate students 1 ability to 
answer to each item correctly. 
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The teachers were instructed to use 
the following 5-point rating scale: 

1. Definitely would not fa- 
cilitate ability to answer 

2. Probably would not facil- 
itate 

3. Uncertain 

4. Probably would facilitate 

5. Definitely would facil- 
itate ability to answer 

Mean ratings for teachers in each 
combination of test form, school, 
and instructional program were 
calculated for each item. 



Purpose 

Initially, the relevancy rat- 
ings were collected to provide a 
check on the fairness or appropri- 
ateness of the items included in 
the Diagnostic Test , It was hoped 
that all or nearly all items would 
be judged to be closely related to 
the content of instruction, espe- 
cially in the experimental group. 

It was also of interest to deter- 
mine whether the comparison teach- 
ers would see the items as less 
relevant than did the experimental 
teachers, as might be expected in 
view of the fact that their in- 
structional goals were not specific 
criteria used in selecting the 
items . 

In examining the data, however, 
the author chanced to see an initial 
item difficulty index for one of the 
subgroups in juxtaposition with the 
mean rating of the item by the teach- 
ers of those particular students. 

All of the students in the group had 
passed the item at pretest, yet all 



of the teachers of those students had 
rated the item 5, implying a definite 
relevancy to instructional content! 

This startling observation led to in- 
vestigation of the overall relationship 
between initial item difficulty and 
teacher ratings of item relevancy* For 
this purpose, correlations between pro- 
portion of students passing each item 
at pretest and mean rating of item rele- 
vancy were calculated for each subgroup 
of students. Ideally, such correlations 
should be negative, indicating that 
teachers place greater emphasis on those 
skills in which students are initially 
weak* Correlations of approximately zero 
would suggest the lack of any systematic 
relationship between entry skills and 
instructional content, seemingly an unde- 
sirable situation* Positive correlations, 
of course, would be even less desirable, 
since such findings would suggest that 
instructional content is oriented to stu- 
dent strengths rather than weaknesses. 



R esults 

As indicated above, the data col- 
lected in this research were analyzed 
for the purpose of determining the na- 
ture of the relationship between the 
entry skills of students and the in- 
structional objectives of their teach- 
ers. Before presenting evidence 
relating to this primary issue, two 
preliminary questions need to be 
dealt with by way of anticipating pos- 
sible alternative interpretations of 
the results . 

(1) Did the teachers in both groups 
judge items on the Diagnostic Test to 
be in general relevant to their in- 
structional objectives and were there 
differen ces in the ratings of experi- 
mental and comparison teachers? Mean 
ratings averaged over the 52 items in 
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each form of the pretest are reported 
in Table 1. These ratings were also 
identified according to school and 
experimental vs. comparison teachers, 
and they show that as a whole the test 
was judged relevant to instructional 
goals as perceived by the teachers 
themselves. All but one of the means 
are above 3.0, the point of uncertain- 
ty. Perusal of the data on individual 
items for each of the twelve subgroups 
shown below did show variability for 
all groups in ratings across items, 
but also revealed many items with mean 
ratings at or close to 5.0 for a given 
group . 

It was anticipated that the ex- 
perimental teachers would judge the 
test to be more relevant, since their 
planned instructional goals were the 
major consideration in the selection 
of test items. This expectation was 
not confirmed. The mean ratings in 
Table 1 do not reveal a pattern of 
differences between ratings by ex- 
perimental and comparison teachers. 

In most cases the means are very 
close for the two groups within each 
school, and the highest mean in thv3 
table was generated by comparison 



teachers at School 1. (2) To what 

extent did experimental and compari- 
son teachers agree on the relevancy 
of individual items of the test? 

While overall ratings of the rele- 
vancy of items were remarkably sim- 
ilar for the experimental and 
comparison teachers, it does not 
necessarily follow that teachers in 
the two groups saw the same items 
as relevant. Indeed, if all teachers 
gave similar relevancy ratings on 
each item, any differences between 
the two groups in the correlations 
of initial item difficulty with rel- 
evancy ratings could only be ex- 
plained in terms of systematic 
differences between experimental 
and comparison students in the skills 
available at entry. Since there is 
no reason to believe that the process 
of assigning students to groups would 
result in systematic differences in 
patterns of entry skills, this ex- 
planation of the results would lead 
nowhere . 

To answer the above question, 
mean ratings by experimental and 
comparison teachers on each item 
were correlated over the 52 items 



Table 1 

Mean Ratings of Relevancy of Diagnostic Test Items* 



Form A Form B 





Experimental 


Comparison 


Experimental 


Comparison 


School 1 


3.71 


4.28 


3.81 


4.23 




(5) 


(3) 


(S) 


(3) 


School 2 


4.0 


3.94 


3.98 


3.82 




(4) 


(8) 


(4) 


(8) 


School 3 


3.61 


3.59 


3.36 


2.95 




(4) 


(2) 


(4) 


(2) 



* Numerals in parentheses refer to number of teachers contributing to each mean 
rating. The rating for item relevancy ranged from "1" (low) to M 5 M (high). 
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within school and by test form. 
These correlations are reported 
in Table 2. Inspection of these 
correlations reveals that only 
in the case of School 2 is there 
a relatively high relationship 
between the relevancy ratings of 
experimental and comparison teach- 
ers. In the case of the other two 
schools, the correlations, while 
positive, are quite weak. A pos- 
sible interpretation of this find- 
ing may lie in the report made by 
members of the evaluation staff 
assigned to the schools as peri- 
odic observers that only at School 
2 were either formal or informal 
discussions between experimental 
and comparison teachers about in- 
structional objectives known to 
have occurred. Although the Diag- 
nostic Test appears to be based on 
reasonably appropriate overall con- 
tent for both experimental and com- 
parison classes, it appears that 
different subgroups of items were 
seen as relevant by experimental 
and comparison teachers in two of 
the schools, with moderate positive 



relationships at the third school. With 
the above in mind the primary question 
posed in this paper can be addressed. 

(3) What relationship pertained be - 
tween the entry skills of students 
as reflected in initial item diffi - 
culty and ratings by teachers of the 
instructional relevancy of those items? 
Correlations between initial item dif- 
ficulty across the 52 items on each 
form are reported in Table 3 for each 
of the twelve subgroups. Two trends 
are immediately apparent in this table. 
First and most important, all of the 
correlations are positive, confound- 
ing the seemingly reasonable expecta- 
tion that the signs of the coefficients 
would be negative. Table 3 reveals 
very clearly that the larger the pro- 
portion of students able to answer each 
item correctly at the beginning of the 
year, the more likely were teachers to 
rate that item as highly relevant to 
their instruction. In short, by their 
own reports, the teachers appeared to 
have selected instructional objectives 
that to a considerable extent reflected 
skills already available to their stu- 
dents . 



Table 2 



Correlations Between Mean Relevancy Ratings on Individual 
Test Items for Experimental and Comparison Teachers 







School 1 


School 2 


School 3 


Form A 


.35 


.62 


.20 


Form B 


.12 


.64 


.18 
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Table 3 



Correlations Between Proportion Answering Each 
Item of Diagnostic Test Correctly at Pretest 
and Teacher Ratings of Item Relevancy 



Form A Form B 





Experimental 


Comparison 


Experimental 


Comparison 


School 1 


.25 


.14 


.62 


.30 


School 2 


.44 


.27 


.58 


.32 


School 3 


.53 


.10 


.62 


.25 



The second trend is also sur- 
prising. Without exception corre- 
lations are higher for experimental 
groups than for corresponding com- 
parison groups for each combination 
of school and test form. For Form 
A the average r_ for experimental 
groups across schools is .42 as com- 
pared to .17 for comparison classes. 
For Form B the average experimental 
group correlation is .61 as against 
.29 for comparison students.. There 
thus appears to have been a greater 
tendency among experimental teachers 
to gear instruction to skills already 
achieved by students at entry into 
the program. 

It may also be noted, incident- 
ally, that the correlations in Table 
3 are invariably higher for Form B of 
the test. This trend can probably be 
ignored, since the author neglected to 
control for order effects when the 
ratings were collected i with the result 



that the items on Form A were always 
rated first. The correlations for 
Form B are perhaps more accurate esti- 
mates in the sense that the judges were 
more practiced. 

Discussion 

How are these results, so incon- 
sistent with what seems to be a rea- 
sonable expectation, to be explained, 
and what are their implications for 
the development and evaluation of in- 
structional programs? We are, of 
course, dealing here with correlational 
research designed to identify relation- 
ships existing in the data rather than 
to explain, as would be the case under 
experimental conditions, the origin of 
relationships. For this reason, and 
because of the possible importance of 
these findings with regard to educa- 
tional practice, several alternative 
explanations need to be considered. 
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A first explanation deserving of 
consideration holds that the results 
reported above on the relationship be- 
tween entry skills and relevancy rat- 
ings are spurious in the sense that 
the teachers could actually have em- 
phasized different content m the 
classroom than was indicated by their 
ratings. That is, perhaps the ratings 
did not reflect what the teachers ac- 
tually did, but rather the opposite 
or at least something quite different 
Admittedly, the motivation for such 
behavior is difficult to construe, but 
reasoning along the following lines 
does not appear unduly contrived We 
can safely assume that teachers are 
sensitive about evaluations others 
make of their performance as reflec- 
ted in the achievement of their stu- 
dents. Moreover, teachers have 
1 sufficient opportunity during the 

year to become aware of the patterns 
of subject matter skills available 
to their students. Given this com- 
bination of desire to "look good" 
and knowledge of what students can 
and cannot do, it would be easy to 
claim credit via the relevancy rat* 
ings for teaching students what they 
already knew in the first place 

While such an uncharitable in- 
terpretation of the results cannot 
be completely discounted, it does 
seem improbable for at least two rea- 
sons. First, there is no positive 
evidence for the assertion that 
teachers were either consciously 
or unconsciously distorting the ac- 
tual situation in their ratings 
On the contrary, there is some evi- 
dence partly formal and partly in- 
formal that the ratings were honest 
reflections of instructional con- 
tent. In another report derived 
from this same research Patalino 
(1968) found frequent instances 



m which greater than average gains from 
pretest to posttest were accompanied 
by higher than average relevancy ratings 
for subgroups of items examined separa- 
tely, suggesting that more emphasis was 
placed on skills rated highly relevant. 
The subgroups of items were not formed 
on the basis of an analysis of the 
teachers * ratings (as had originally 
been intended) but rather because of 
judged similarity of content. There 
are also informal instances of reports 
by observers of students commenting 
to the teachers that at least some of 
the material was familiar 

A second interpretation of the re- 
sults assumes that the ratings do re- 
flect accurately the content emphasized 
by the teachers- This assumption is at 
least consistent with the tentative evi- 
dence just cited- Again given that they 
are motivated to be judged effective in 
their work, this interpretation asserts 
that teachers find it tempting to teach 
available skills, knowing consciously 
or unconsciously that their students will 
then appear to be performing well, espe- 
cially when they are being observed by 
outside evaluators. This explanation 
would account for the differences between 
the correlations in Table 3 for experi- 
mental and comparison groups in the sense 
that the experimental teachers were un- 
doubtedly under greater internal pressure 
to succeed, since they were "master" 
teachers participating in an experimental 
program with high visibility. Not only 
behavioral scientists but also a variety 
of educators were observing their class- 
room and materials Except for the a- 
chievement testing comparison teachers 
were in a very mucn more typical situa- 
tion with regard to visibility,. 

The above explanation is plausible, 
but there is an additional interesting 
possibility- The instruction of urban, 
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minority children who are not achiev- 
ing as well as their own peers is 
likely to be very hard work for teach- 
ers. Moreover, experimental teachers 
participated in sensitivity training 
sessions in which it was stressed that 
such children are likely to associate 
academic aspects of school with a 
sense of personal failure and inad- 
equacy. In effect, experimental 
teachers were thus being urged not to 
give the children in the program fur- 
ther experiences with failure. These 
two conditions would also account for 
the fact that both experimental and 
comparison teachers apparently direc- 
ted instruction at skills already 
available (it was an easier alterna- 
tive than trying to teach new content), 
as well as the fact that experimental 
teachers did so to a greater degree 
in order to avoid confronting their 
students with further failure experi- 
ences . 

As plausible as the two explana- 
tions may be for the conclusion that 
the relevancy ratings reflected in- 
structional content accurately (and 
both could be operating at the same 
time), one could still argue that the 
results do not make sense because 
learning would not go on at all in 
the schools if instructional objec- 
tives were confined to what students 
already knew. In reply it can be 
noted that while the above correla- 
tions are not perfect relationships 
and do not completely exclude the 
possibility of some new material 
being introduced into the curricula 
studied, as it undoubtedly was, this 
report does not deal with students 
from the affluent middle classes, 
but with urban minority students who 
are already far behind in achievement 
and who, if Coleman's (1966) findings 



apply, will fall further behind with time. 
The present results are quite consistent 
with this well -documented phenomenon. 

Thus, it seems reasonable to conclude that 
the frustrations encountered in teaching 
educationally handicapped students as well 
as the need perceived by teachers to pro- 
vide such students with experiences of 
success and also the teachers' own needs 
to perform well, especially under condi- 
tions of close observation, may well lead 
teachers to make the task of instruction 
easier by emphasizing those areas of con- 
tent in which present capabilities of stu- 
dents are relatively more developed. 



Implications 

Of the two major implications of these 
findings, the first relates to instructional 
practice and the second to the methodology 
of evaluation. With regard to the former, 
it is readily apparent that teachers do 
need to be informed about the entry skills 
of their students as related to the objec- 
tives of a course of instruction, because 
without information there is undue latitude 
for the operation of other irrelevant fac- 
tors in decisions about curriculum. Such 
information on entry skills will be most 
useful if referred to specific, unequivocal 
objectives such as those described by 
Popham (1969) . The importance of obtaining 
information on entry skills has, of course, 
been stressed by others, including Glaser 
(1967), in the development of individualized 
instructional curricula , 

Secondly, it is clear in the present 
case that data on entry skills could have 
been most useful to the teachers develop- 
ing the program had they been made avail- 
able early in the research. This failure 
to meet the needs of program developers is 
illustrative of an all too common phenom- 
enon in evaluation. There is a widespread 
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tendency for researchers engaged in 
the evaluation of instruction to 
concentrate on collecting data rele- 
vant to program adoption at the ex- 
pense of data relevant to program 
development . That is, behavioral 
scientists typically approach eval- 
uation of educational practices 
with the analogue of the experiment 
firmly in mind. This leads to un- 
due concern with answering the 
question, "Is the new program bet- 
ter than the old?", and results in 
neglect of the more important task 
of helping program developers make 
certain the answer will turn out 
to be in the affirmative. Unlike 
the experimenter in a controlled 
laboratory situation, it is highly 
appropriate for the educational 
researcher in the role of evalua- 
tor to produce data that will 
lead to modifications in "treat- 
ment" variables while the re- 
search is going on. As illustrated 
in the present case, if the evalua- 
tor does not seek out systematic in- 
formation relevant to program 
development, it is not likely that 
others will. This need has already 
been pointed out by Cronbach (1963), 
Stufflebeam (1968), and by others. 
Certainly, the present research pro- 
vides strong support for the asser- 
tion that all who are involved in 
the development and evaluation of 
programs of instruction should mon- 
itor the entry skills of the target 
population . 
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